Search CORE

140 research outputs found

Surprises in High-Dimensional Ridgeless Least Squares Interpolation

Author: Hastie Trevor
Montanari Andrea
Rosset Saharon
Tibshirani Ryan J.
Publication venue
Publication date: 07/12/2020
Field of study

Interpolators -- estimators that achieve zero training error -- have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum

\ell_2

norm (``ridgeless'') interpolation in high-dimensional least squares regression. We consider two different models for the feature distribution: a linear model, where the feature vectors

x_i \in {\mathbb R}^p

are obtained by applying a linear transform to a vector of i.i.d.\ entries,

x_i = \Sigma^{1/2} z_i

(with

z_i \in {\mathbb R}^p

); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network,

x_i = \varphi(W z_i)

(with

z_i \in {\mathbb R}^d

W \in {\mathbb R}^{p \times d}

a matrix of i.i.d.\ entries, and

\varphi

an activation function acting componentwise on

W z_i

). We recover -- in a precise quantitative way -- several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.Comment: 68 pages; 16 figures. This revision contains non-asymptotic version of earlier results, and results for general coefficient

arXiv.org e-Print Archive

Elastic Net Regularization Paths for All Generalized Linear Models

Author: Hastie Trevor
Narasimhan Balasubramanian
Tay J. Kenneth
Publication venue: 'Foundation for Open Access Statistic'
Publication date: 23/03/2023
Field of study

The lasso and elastic net are popular regularized regression models for supervised learning. Friedman, Hastie, and Tibshirani (2010) introduced a computationally efficient algorithm for computing the elastic net regularization path for ordinary least squares regression, logistic regression and multinomial logistic regression, while Simon, Friedman, Hastie, and Tibshirani (2011) extended this work to Cox models for right-censored data. We further extend the reach of the elastic net-regularized regression to all generalized linear model families, Cox models with (start, stop] data and strata, and a simplified version of the relaxed lasso. We also discuss convenient utility functions for measuring the performance of these fitted models

Journal of Statistical Software

Strong rules for discarding predictors in lasso-type problems

Author: Bien Jacob
Friedman Jerome
Hastie Trevor
Simon Noah
Taylor Jonathan
Tibshirani Robert
Tibshirani Ryan J.
Publication venue
Publication date: 24/11/2010
Field of study

We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui et al (2010) propose "SAFE" rules that guarantee that a coefficient will be zero in the solution, based on the inner products of each predictor with the outcome. In this paper we propose strong rules that are not foolproof but rarely fail in practice. These can be complemented with simple checks of the Karush- Kuhn-Tucker (KKT) conditions to provide safe rules that offer substantial speed and space savings in a variety of statistical convex optimization problems.Comment:

arXiv.org e-Print Archive

CiteSeerX

Distill-and-Compare: Auditing Black-Box Models Using Transparent Model Distillation

Author: Adebayo Julius
Chouldechova Alexandra
Datta A.
Hastie Trevor J
Hinton Geoffrey
Kim Michael P
Tramer Florian
Wang Hao
Zhang Zhe
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/10/2018
Field of study

Black-box risk scoring models permeate our lives, yet are typically proprietary or opaque. We propose Distill-and-Compare, a model distillation and comparison approach to audit such models. To gain insight into black-box models, we treat them as teachers, training transparent student models to mimic the risk scores assigned by black-box models. We compare the student model trained with distillation to a second un-distilled transparent model trained on ground-truth outcomes, and use differences between the two models to gain insight into the black-box model. Our approach can be applied in a realistic setting, without probing the black-box model API. We demonstrate the approach on four public data sets: COMPAS, Stop-and-Frisk, Chicago Police, and Lending Club. We also propose a statistical test to determine if a data set is missing key features used to train the black-box model. Our test finds that the ProPublica data is likely missing key feature(s) used in COMPAS.Comment: Camera-ready version for AAAI/ACM AIES 2018. Data and pseudocode at https://github.com/shftan/auditblackbox. Previously titled "Detecting Bias in Black-Box Models Using Transparent Model Distillation". A short version was presented at NIPS 2017 Symposium on Interpretable Machine Learnin

arXiv.org e-Print Archive

Crossref

Comparing spatial patterns of marine vessels between vessel-tracking data and satellite imagery

Author: Colette C.C. Wabnitz
Colette C.C. Wabnitz
Elizabeth R. Selig
Fiorenza Micheli
Fiorenza Micheli
Fiorenza Micheli
Jim Leape
Richard G. Correro
Serena Yeung
Shinnosuke Nakayama
Trevor J. Hastie
Trevor J. Hastie
WenXin Dong
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2023
Field of study

Monitoring marine use is essential to effective management but is extremely challenging, particularly where capacity and resources are limited. To overcome these limitations, satellite imagery has emerged as a promising tool for monitoring marine vessel activities that are difficult to observe through publicly available vessel-tracking data. However, the broader use of satellite imagery is hindered by the lack of a clear understanding of where and when it would bring novel information to existing vessel-tracking data. Here, we outline an analytical framework to (1) automatically detect marine vessels in optical satellite imagery using deep learning and (2) statistically contrast geospatial distributions of vessels with the vessel-tracking data. As a proof of concept, we applied our framework to the coastal regions of Peru, where vessels without the Automatic Information System (AIS) are prevalent. Quantifying differences in spatial information between disparate datasets—satellite imagery and vessel-tracking data—offers insight into the biases of each dataset and the potential for additional knowledge through data integration. Our study lays the foundation for understanding how satellite imagery can complement existing vessel-tracking data to improve marine oversight and due diligence

Directory of Open Access Journals

Gene Expression Programs of Human Smooth Muscle Cells: Tissue-Specific Differentiation and Prognostic Significance in Breast Cancers

Author: Dimitry S. A Nuyten
Edwin H Rodriguez
Emmanouil T Dermitzakis
Jen-Tsan Chi
Marc J. van de Vijver
Matt van de Rijn
Patrick O Brown
Sayan Mukherjee
Trevor Hastie
Zhen Wang
Publication venue: Public Library of Science
Publication date: 01/01/2007
Field of study

Smooth muscle is present in a wide variety of anatomical locations, such as blood vessels, various visceral organs, and hair follicles. Contraction of smooth muscle is central to functions as diverse as peristalsis, urination, respiration, and the maintenance of vascular tone. Despite the varied physiological roles of smooth muscle cells (SMCs), we possess only a limited knowledge of the heterogeneity underlying their functional and anatomic specializations. As a step toward understanding the intrinsic differences between SMCs from different anatomical locations, we used DNA microarrays to profile global gene expression patterns in 36 SMC samples from various tissues after propagation under defined conditions in cell culture. Significant variations were found between the cells isolated from blood vessels, bronchi, and visceral organs. Furthermore, pervasive differences were noted within the visceral organ subgroups that appear to reflect the distinct molecular pathways essential for organogenesis as well as those involved in organ-specific contractile and physiological properties. Finally, we sought to understand how this diversity may contribute to SMC-involving pathology. We found that a gene expression signature of the responses of vascular SMCs to serum exposure is associated with a significantly poorer prognosis in human cancers, potentially linking vascular injury response to tumor progression

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central